Intro

The dataset is given on the context that this is a pre-crawled dataset, taken as subset of a bigger dataset that was created by extracting data from Naukri.com, a leading job board.

sadly it is not known how the data were subseted, whether analysising this data would give a breif overview with regards to job market over a period of time or certain industry.

libraries

require(tidyverse)
require(data.table)
require(lubridate)
require(stringr)
require(ggvis)
require(ggplot2)
require(forcats)
require(ggmap)
require(highcharter)
require(broom)
require(plotly)
# setwd(dir="D:\\Users\\gdrtmh\\Desktop\\Kaggle\\Jobs on Naukri")
inputdt <- as.data.table(read.csv(file="naukri_com-job_sample.csv", strip.white = TRUE, 
                             na.strings=c("","NA","Not Mentioned","Not Disclosed by Recruiter"), 
                             stringsAsFactors = FALSE))

# setting "Not mentioned" as NA for experience etc.
# setting "Not Disclosed by Recruiter" that is the norm for NA in payrate

Data Exploration & Basic Analysis

This section seems to clean and organise the input data, while also investigate its basic quality/distribution.

glimpse(inputdt)
## Observations: 22,000
## Variables: 14
## $ company             <chr> "MM Media Pvt Ltd", "find live infotech", ...
## $ education           <chr> "UG: B.Tech/B.E. - Any Specialization PG:A...
## $ experience          <chr> "0 - 1 yrs", "0 - 0 yrs", "4 - 8 yrs", "11...
## $ industry            <chr> "Media / Entertainment / Internet", "Adver...
## $ jobdescription      <chr> "Job Description   Send me Jobs like this...
## $ jobid               <dbl> 210516002263, 210516002391, 101016900534, ...
## $ joblocation_address <chr> "Chennai", "Chennai", "Bengaluru", "Mumbai...
## $ jobtitle            <chr> "Walkin Data Entry Operator (night Shift)"...
## $ numberofpositions   <int> NA, 60, NA, NA, 4, NA, 2, 20, 2, NA, NA, N...
## $ payrate             <chr> "1,50,000 - 2,25,000 P.A", "1,50,000 - 2,5...
## $ postdate            <chr> "2016-05-21 19:30:00 +0000", "2016-05-21 1...
## $ site_name           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ skills              <chr> "ITES", "Marketing", "IT Software - Applic...
## $ uniq_id             <chr> "43b19632647068535437c774b6ca6cf8", "d4c72...

site_name

summary(as.factor(inputdt$site_name))
## www.naukri.com           NA's 
##           3987          18013
inputdt$site_name <- NULL

Well, it seems like the site name is pretty filled with either naukri or NA data. This column could probably be removed without influencing the subsequent data analysis.

uniq_id

uniqueN(inputdt$uniq_id)
## [1] 22000

This Would be useful as an unique identifier for individual job listing.

Jobid

uniqueN(inputdt, by=c("jobid"))
## [1] 21910

It is not exactly clear how “jobid” is structed, perhaps it accounts of relisting of jobs?

Although, since the number of unique jobid is only differing from uniq_id by 90, if the purpose where to depict the relisting of respective jobs, it probably did not do its job very well.

Payrate

# Seeing that payrate listing would corresponds to format of "1,50,000 - 2,50,000 P.A"
# I would just use gsub to remove strings after P.A
inputdt$payrate.M<-gsub( " P.A.*$", "", inputdt$payrate)
# this is essential to remove numerical characters after the P.A string
# this also removed certain listed incentive/commission information in payrate

# stripping all the non numerical chr from the payrate
inputdt$payrate.M<-gsub("[^0-9-]", "", inputdt$payrate.M)
# inputdt$m.payrate<-gsub("[^0-9,-,-]", "", inputdt$m.payrate)

# Outliers
# Example of some payrate that is not extracated properly ( with 2 "-" etc after regexp treatment), 
# seeing there is no easy solution and that there is only ~200 of them, them will be ignored for now
inputdt[lengths(strsplit(inputdt$payrate.M, "-"))>2, list(payrate,payrate.M),]
## # A tibble: 197 x 2
##                                                                        payrate
##                                                                          <chr>
##  1 Education- BDS, Be/b.tech [bio-chemistry, Bio-technology, Biomedical], Bams
##  2 PICK YOUR Job-tick YOUR Career-we Click IT FOR YOU! Yogam IS YOUR Career Co
##  3 Consolidated Pay : Rs.40,000-44,000-52,000- 56,000-60,000-64,000/- p.m. plu
##  4 Monthly Salary: Pay Scale of Rs.15600- 39100/- + Grade Pay Rs.6600/- + NPA.
##  5                                          Salary: RS.(52,000-5,000 - 77,000)
##  6 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m.For direct recru
##  7                 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m.
##  8 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m.For direct recru
##  9 Pay Scale: PB-4 (Rs.37400-67000) with AGP of Rs.9500/- p.m. For direct recr
## 10                 Pay Scale: PB-3 (Rs.I5600-39100) with AGP of Rs.8000/- p.m.
## # ... with 187 more rows, and 1 more variables: payrate.M <chr>

COuld potentially improve the regular expresssion by extracting only numbers between the dashes, but that would require more work with regeexp, going to put it off for now.

Splitting the payrate column while conditionally ignoring those that have more than dash (“-”) symbol.

# Identifying the appropriate number of splits for payrate with str "-"
splits <- max(lengths(strsplit(inputdt$payrate.M, "-")))

# Spliting the payrate into m.payrate1 and m.payrate2, Ignoring those column with additional splits
mdt <- inputdt[lengths(strsplit(inputdt$payrate.M, "-"))<=2,
               paste0("m.payrate", 1:2) := tstrsplit(payrate.M, "-", fixed=TRUE)][] 

# Changing the column into numerics
mdt$m.payrate1<- as.numeric(mdt$m.payrate1)
mdt$m.payrate2<- as.numeric(mdt$m.payrate2)

# account for case where there is only one listed payrate instead of a range
mdt<-mdt[is.na(m.payrate2), m.payrate2:=m.payrate1]

# Creating a column that indicate the mean of the payrate, would be easier on plotting etc
# na.rm = T to enable those listing that only depicted single number of payrate to have a mean
mdt[, m.payrate.Mean := rowMeans(.SD), by = uniq_id, .SDcols = c("m.payrate1", "m.payrate2")]

The mean value and median value of the payrate should almost be equivalent given that in this case, they are both determined from the listed upper and lower window of payrate.

# Checking if the values in experience1 are always lower than experience2, 
# such that experience1 can be considered as a lower limit for the job, vice versa for experience2
# summary(mdt$m.payrate1>mdt$m.payrate2)
#    Mode   FALSE    TRUE    NA's 
# logical       2    4743   17255 

# listing the data with m.payrate1 > m.payrate2
mdt[m.payrate1>m.payrate2,list(payrate,payrate.M,m.payrate1,m.payrate2)]
## # A tibble: 2 x 4
##                                     payrate payrate.M m.payrate1
##                                       <chr>     <chr>      <dbl>
## 1 Max.package15.5 Lpa... 25-30% hike on CTC  15525-30      15525
## 2              55,00,000 - 1 Cr & above P.A 5500000-1    5500000
## # ... with 1 more variables: m.payrate2 <dbl>

In both of these cases, the payrate is not well formated and extracted. There is probably some bad extraction in the extracting of payrate that is not depicted here as well as they may be hidden behind m.payrate2 > m.payrate1.

Removing these cases (where m.payrate1>m.payrate2) by setting them to NA.

mdt[(m.payrate1>m.payrate2),`:=`(payrate.M= NA,
                                 m.payrate1= NA,
                                 m.payrate2= NA,
                                 m.payrate.Mean= NA)]


# The majority of the records has no payrate listed.
# summary(mdt$m.payrate.Mean)
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
# 9.000e+00 2.500e+05 3.750e+05 3.672e+09 8.000e+05 1.740e+13     17255 
# The maximum figure also points to likely error in extracation

# after attempting a few threshold of cut, I find 9e+6 to be most suitable, listed below are a table depicted payrate after this threshold.
mdt[m.payrate.Mean>9e+6] %>%
  .[,list(payrate, m.payrate.Mean)]
## # A tibble: 359 x 2
##                                                                        payrate
##                                                                          <chr>
##  1 Salary structure is: 32,000 INR (for thesis submitted), 36,000 INR (PhD), 3
##  2                                                              10000 to 45000
##  3                        Pay Scale: INR Rs.37400-67000 + Grade Pay Rs.10000/-
##  4 Pay : CDA Pay Scale : 9300 - 34800 + GP 4800 Gross pay including allowances
##  5      Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
##  6      Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
##  7      Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
##  8     Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.10,500/p.m.
##  9      Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
## 10      Pay Band: PB4 (Rs.3740067000) with Academic Grade Pay of Rs.9,500/p.m.
## # ... with 349 more rows, and 1 more variables: m.payrate.Mean <dbl>

In fact, it almost seems like the posting of payrate = “Pay Band: PB3 (Rs.15600-39100) with Academic Grade Pay of Rs.8,000/p. m.” are just reposting seeing about 80 listing of jobs are having identifical payrate.

# for an easy way out, it seems just removing all extracted payrate data for these badly extracted payrate will do. ie:
mdt[m.payrate.Mean>9e+6,`:=`(payrate.M= NA,
                             m.payrate1= NA,
                             m.payrate2= NA,
                             m.payrate.Mean= NA)]


#total count of valid payrate records
N_valid_payrate<-sum(!is.na(mdt$m.payrate.Mean))

payrate_record_percentage <- sum(!is.na(mdt$m.payrate.Mean))/sum((mdt$m.payrate.Mean), na.rm=T)

Sadly, there is only payrate_record_percentage of the job listing that has its payrate listed and extracted so fr, that would be a mere N_valid_payrate records.

# all_values <- function(x) {
#   if(is.null(x)) return(NULL)
#   paste0(names(x), ": ", format(x), collapse = "<br />")
# }

mdt[!is.na(m.payrate.Mean),][m.payrate.Mean<9e+6] %>%
  melt(.,  measure.vars = patterns("^m.payrate")) %>%
  ggvis(~value, fill = ~variable) %>%
  group_by(variable) %>%
  layer_densities() %>%
  add_axis("x", title = "Payrate Per Annum") %>%
  add_axis("y", title = "Density")
# Oddly I couldn't find documented (and not dirty) approach in inserting a plot title with ggvis. I will just leave it as it is for now.

Would be interesting to see the how the payrate woudl vary per industry/skill/location in a later section. Though, if we narrow down our windowing.

mdt[!is.na(m.payrate.Mean),][m.payrate.Mean<2e+6] %>%
  melt(.,  measure.vars = patterns("^m.payrate")) %>%
  ggvis(~value, fill = ~variable) %>%
  group_by(variable) %>%
  layer_densities() %>%
  add_axis("x", title = "Payrate Per Annum") %>%
  add_axis("y", title = "Density")

Well it seems that for the payrate of the listed jobs, the median of listed salary is just about ~300k repee PA ( since the jobboard seems to be based in india).

Education

head(mdt$education,5)
## [1] "UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, Post Graduation Not Required"                                                             
## [2] "UG: B.Tech/B.E. - Any Specialization PG:MBA/PGDM - Any Specialization"                                                                                                   
## [3] "UG: Any Graduate - Any Specialization PG:Any Postgraduate Doctorate:Doctorate Not Required"                                                                              
## [4] "UG: Any Graduate - Any Specialization PG:CA Doctorate:Doctorate Not Required"                                                                                            
## [5] "UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, MCA - Computers, M.Tech - Any Specialization Doctorate:Any Doctorate - Any Specialization"

It can be seen ( after skimming through multiple lines)that Education can be split into different field quite accurately with delimiter of “UG:”, “PG:” and “Doctorate:”. Also, in some case, the education field can be listed, yet are noted as “Not Required”

mdt %>%
  .[grep("Diploma",education),m.Education.tf.Diploma := TRUE] %>% #Diploma
  .[,m.Education.UG :=trimws(str_extract(education, "(?<=UG: ).*(?=PG:)|(?<=UG: ).*$"))] %>%
  .[grep("Not Required",m.Education.UG),m.Education.UG :=NA] %>% #Removing "not required"
  .[!is.na(m.Education.UG),m.Education.tf.UG:=TRUE] %>%
  .[,m.Education.PG :=trimws(str_extract(education, "(?<=PG:).*(?=Doctorate:)|(?<=PG:).*$"))] %>%
  .[grep("MBA",m.Education.PG),m.Education.tf.MBA := TRUE] %>% #MBA
  .[grep("Not Required",m.Education.PG),m.Education.PG :=NA] %>%
  .[!is.na(m.Education.PG),m.Education.tf.PG:=TRUE] %>%
  .[,m.Education.DR :=trimws(str_extract(education, "(?<=Doctorate:).*$"))] %>%
  .[grep("Not Required",m.Education.DR),m.Education.DR :=NA] %>%
  .[!is.na(m.Education.DR),m.Education.tf.DR:=TRUE]

Note that I also created some “m.Education.tf.x” (where x= UG, Diploma, PG, MBA, DR ) to indicate the need for designated education level in True or NA.

In some case, such as

mdt[(33:35)]$m.Education.UG
## [1] "Any Graduate - Any Specialization, B.Sc - Any Specialization, Maths, Physics, B.Com - Commerce"
## [2] "Any Graduate - Any Specialization"                                                             
## [3] "B.Tech/B.E. - Bio-Chemistry/Bio-Technology, Biomedical, Chemical, B.Pharma - Pharmacy"

Additional split on “,” will allow more details to be extracted, but I will left this alone for now. Going further would probably require diving deep into NLP or pulling a dictionary (to account for shortforms and synonyms).

Number of job listing for a particular level of Education

mdt %>%
  .[,lapply(.SD, sum, na.rm=TRUE) ,.SDcols= (names(mdt) %like% "^m.Education.tf*")]
## # A tibble: 1 x 5
##   m.Education.tf.Diploma m.Education.tf.UG m.Education.tf.MBA
##                    <int>             <int>              <int>
## 1                    669             17071               3182
## # ... with 2 more variables: m.Education.tf.PG <int>,
## #   m.Education.tf.DR <int>

Education Factor

Note that the number of job listing table on top allowed mulitple entries of job for each different ncategories of education. For example, A job that listed education requirement as UG: and PG: would be counted as 1 in both categories.

The following codes created a section that would only attach the highest recorded education for everyt individual job listing. Ie, if the job listed for “UG: B.Sc, PG: Masters”, it would be recorded as Master level job in this column. I assume the level of education follow a nested/linear path that corresponds to: Diploma > UG > MBA > PG > DR. The position of MBA is perhaps slightly controversial.

# This list of data table chain simply utilise grep(str_detect) and is.na( the existance of UG, PG, DR) in previous education columns to assign a unique factor to the column

mdt %>%
  .[grep("Diploma",education),m.Education.factor := "Diploma"] %>% 
  .[!is.na(m.Education.UG),m.Education.factor := "UG"] %>%
  .[grep("MBA",education),m.Education.factor := "MBA"] %>%
  .[!is.na(m.Education.PG),m.Education.factor := "PG"] %>%
  .[!is.na(m.Education.DR),m.Education.factor := "DR"]
  
mdt$m.Education.factor <- factor(mdt$m.Education.factor)


# There outough to be better methods to melt, extract the highest education requirement information but I failed to extracted/melt it cleanly from the previous extracted information.

Skills

#head of records in skill
head(mdt$skills)
## [1] "ITES"                                 
## [2] "Marketing"                            
## [3] "IT Software - Application Programming"
## [4] "Accounts"                             
## [5] "IT Software - Application Programming"
## [6] "IT Software - Application Programming"
# unique(inputdt$skills)

listing of top 10 skill in demand

mdt[!is.na(skills),.N,by=list(skills)] %>%
  .[,.SD[order(-N)]] %>%
  .[, head(.SD, 10),] %>%
  .[,value:=fct_inorder(factor(skills))] %>% # fct_infreq from "forcats" to reoder the factor in x axis
  ggvis(x=~skills, y=~N) %>%
  layer_bars(fillOpacity := 0.1) %>%
  add_axis("x", properties = axis_props(
    labels = list(angle = 45, align = "left", fontSize = 10)
  ))
## Warning: package 'bindrcpp' was built under R version 3.3.3

This shows that the most demanding skills are IT Software - Application Programming, followed by Sales and ITES.

 mdt$skills <- as.factor(mdt$skills)

Industry

# Checking whether every val  id entry has "yrs" in the records, 
# summary(!is.na(mdt$industry))
#    Mode   FALSE    TRUE    NA's 
# logical       5   21995       0 
# Majority of the job listing did listed valid industry

head(mdt$industry)
## [1] "Media / Entertainment / Internet"        
## [2] "Advertising / PR / MR / Event Management"
## [3] "IT-Software / Software Services"         
## [4] "Banking / Financial Services / Broking"  
## [5] "IT-Software / Software Services"         
## [6] "IT-Software / Software Services"
# since there is limited number of combination of industry, this column should be set as a factor
mdt$industry <- as.factor(mdt$industry)

It seems most job listing are well arranged, it is almost listed in tree like structure that.

# # splitting the mdt$industry with "/" into industryS1, industryS2 etc
# splits <- max(lengths(strsplit(mdt$industry, "/")))
# mdt <- mdt[,paste0("industryS", 1:splits) := tstrsplit(industry, " / ", fixed=TRUE)][]
# 
# 
# mdt[, industry_combined := paste(sort(.SD), collapse = " / "), by = 1:nrow(mdt),
#    .SDcols = names(mdt) %like% "^industryS"]
# head(mdt$industry_combined)
# 
# mdt[, .N, by= industry_combined] %>%
#   .[,.SD[order(-N)]]

mdt[, .N, by= industry] %>%
  .[,.SD[order(-N)]]
## # A tibble: 64 x 2
##                                         industry     N
##                                           <fctr> <int>
##  1               IT-Software / Software Services  9216
##  2               Education / Teaching / Training  1322
##  3                      BPO / Call Centre / ITES  1254
##  4        Banking / Financial Services / Broking  1238
##  5                        Recruitment / Staffing  1129
##  6                          Internet / Ecommerce   738
##  7          Pharma / Biotech / Clinical Research   525
##  8              Medical / Healthcare / Hospitals   495
##  9 Automobile / Auto Anciliary / Auto Components   478
## 10  Construction / Engineering / Cement / Metals   449
## # ... with 54 more rows

Experience

# dt[!str_detect(dt$experience, "yrs")]# It seems all the listing are in a matter of years

# one entry with  double dash "-"
# glimpse(dt[max(lengths(strsplit(dt$experience, "-")))==lengths(strsplit(dt$experience, "-"))])
mdt[max(lengths(strsplit(mdt$experience, "-")))==lengths(strsplit(mdt$experience, "-"))]$experience <- "1 - 3 yrs"

# Removing the last 3 char - "yrs" in the experience records
# nchar(mdt$experience, allowNA = TRUE)-3
mdt$experience.M <- str_sub(mdt$experience, 1,nchar(mdt$experience, allowNA = TRUE)-4)

# Splitting the experience
splits <- max(lengths(strsplit(mdt$experience.M, "-")))
mdt <- mdt[,paste0("experience", 1:splits) := tstrsplit(experience.M, "-", fixed=TRUE)][]

# Changing them to appropriate class
mdt$experience1<-as.numeric(mdt$experience1)
mdt$experience2<-as.numeric(mdt$experience2)

# Create a columm for median of experience
mdt[, m.experience.Median:=median(c(experience1, experience2)), by = uniq_id]


# Checking if the values in experience1 are always lower than experience2, 
# such that experience1 can be considered as a lower limit for the job, vice versa for experience2
# summary(mdt$experience1<=mdt$experience2)
#    Mode    TRUE    NA's 
# logical   21885     115 

# renaming the experience1 and experience2 column into more obvious form
setnames(mdt, c("experience1","experience2"), c("m.experience.L", "m.experience.U"))


mdt[!is.na(m.experience.L),]%>%
  melt(.,  measure.vars = patterns("^m.experience")) %>%
  ggvis(~value, fill = ~variable) %>%
  group_by(variable) %>%
  layer_densities(adjust = 2) %>%  #adjust parameter to smooth the wiggliness of density
  add_axis("x", title = "Number of years") %>%
  add_axis("y", title = "Density")

Aggregate/Dice Experience to larger groups

mdt<-mdt[, m.experience.Cut := cut(m.experience.Median, 
                                   breaks=c(-Inf, 3*(0:3), 10+4*(0:2), Inf), right=F)]
mdt[is.na(m.experience.Cut)]$m.experience.Cut <- "Not listed"
# By default NA  ( those not in the range of the cut are outside our boundaries and not counted. Started the break with -Inf to include them in the calculation

Number of positions

class(mdt$numberofpositions)
## [1] "integer"
summary(mdt$numberofpositions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.00    4.00   45.12   10.00 2000.00   17536

Intestingly, there are job posting listed to have 2000 positions available, a quick look at this outlier.

mdt[numberofpositions==2000]
## # A tibble: 2 x 31
##                    company
##                      <chr>
## 1 HCL Technologies Limited
## 2 HCL Technologies Limited
## # ... with 30 more variables: education <chr>, experience <chr>,
## #   industry <fctr>, jobdescription <chr>, jobid <dbl>,
## #   joblocation_address <chr>, jobtitle <chr>, numberofpositions <int>,
## #   payrate <chr>, postdate <chr>, skills <fctr>, uniq_id <chr>,
## #   payrate.M <chr>, m.payrate1 <dbl>, m.payrate2 <dbl>,
## #   m.payrate.Mean <dbl>, m.Education.tf.Diploma <lgl>,
## #   m.Education.UG <chr>, m.Education.tf.UG <lgl>, m.Education.PG <chr>,
## #   m.Education.tf.MBA <lgl>, m.Education.tf.PG <lgl>,
## #   m.Education.DR <chr>, m.Education.tf.DR <lgl>,
## #   m.Education.factor <fctr>, experience.M <chr>, m.experience.L <dbl>,
## #   m.experience.U <dbl>, m.experience.Median <dbl>,
## #   m.experience.Cut <fctr>

The majority of the job listing doesn’t list a number of the number of positions, I think it is reasonable to assump those without listed number in ‘Number of position’ as 1. This may not be entirely accurate but it will manipulation with other data in later section.

mdt<-mdt%>%
  .[, m.Npositions:=numberofpositions] %>%
  .[is.na(numberofpositions), m.Npositions:=1]

summary(mdt$m.Npositions)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    9.952    1.000 2000.000

Now all recorded number of positions have a valid number.

Date Time

split and time zone

head(mdt$postdate)
## [1] "2016-05-21 19:30:00 +0000" "2016-05-21 19:30:00 +0000"
## [3] "2016-10-13 16:20:55 +0000" "2016-10-13 16:20:55 +0000"
## [5] "2016-10-13 16:20:55 +0000" "2016-10-13 16:20:55 +0000"
#str split the postdate based on format corresponds to "2016-05-21 19:30:00 +0000" on "+"
mdt[, c("postdate_time","postdate_timezone") := tstrsplit(postdate,"+",2)]

# summary(as.factor(mdt$postdate_timezone))
# 0000  NA's 
# 21977    23 
# it seems timezone is pretty meaningless in the data base too

# Removing timezone column
mdt$postdate_timezone <- NULL

formatting

mdt$postdate_time <- parse_date_time(mdt$postdate_time, "%Y-%m-%d H:M:S")
#ignoring timezone for now, shouldn't make any difference too

min(mdt$postdate_time, na.rm =T)
## [1] "2015-01-24 19:48:31 UTC"
max(mdt$postdate_time, na.rm =T)
## [1] "2017-01-11 21:00:00 UTC"

Number of job posting over time

#
mdt[, m.timeCut := cut(as.POSIXct(postdate_time), breaks="1 month")]

mdt[!is.na(m.timeCut), .N, by=m.timeCut] %>%
  ggvis(~m.timeCut, ~N) %>%
  layer_bars(fillOpacity := 0.1) %>%
  add_axis("x", properties = axis_props(
    labels = list(angle = 45, align = "left", fontSize = 10)
  ))
# alternative data table method
# mdt[, .N, by=month(postdate_time)] 

# alternate dplyr solution
# mdt$uniq_id
# mdt %>% 
#   group_by(yr = year(postdate_time), mon = month(postdate_time)) %>% 
#   summarise(mn_amt = length(uniq_id))

If we assume that this captured all the job listing over 2016 to 2017 (probably unlikely given the huge variance over the period and that there are months where there is little to no listing), it would seems that most jobs are posted in March in 2016 ( Start of year-ish), followed by Nov 2015 (End of year). I wonder if this coincide with the timing of graduation/bonus etc.

Location

Finally we are at location, I have to admit that the reason I went ahead with ggvis is predominaly because of their map package.

head(mdt$joblocation_address,5)
## [1] "Chennai"                                                   
## [2] "Chennai"                                                   
## [3] "Bengaluru"                                                 
## [4] "Mumbai, Bengaluru, Kolkata, Chennai, Coimbatore, Tamilnadu"
## [5] "Bengaluru"

Some job listing are listed at multiple location, such as “Delhi NCR, Mumbai, Bengaluru, Kochi, Greater Noida, Gurgaon, Hyderabad, Kozhikode, Lucknow”. Hence, a strsplit with arguements for “,”, “/”, should do the job.

Highchater: Hcmap

Via join with mapdata

mapdata <- get_data_from_map(download_map_data("countries/in/in-all"))
# glimpse(mapdata)
# Observations: 34
# Variables: 20
# $ hc-group    <chr> "admin1", "admin1", "admin1", "admin1", "admin1", "admin1", "admin1", "adm...
# $ hc-middle-x <dbl> 0.65, 0.59, 0.50, 0.56, 0.46, 0.46, 0.51, 0.59, 0.47, 0.56, 0.60, 0.63, 0....
# $ hc-middle-y <dbl> 0.81, 0.63, 0.74, 0.38, 0.64, 0.51, 0.34, 0.41, 0.60, 0.32, 0.47, 0.55, 0....
# $ hc-key      <chr> "in-py", "in-ld", "in-wb", "in-or", "in-br", "in-sk", "in-ct", "in-tn", "i...
# $ hc-a2       <chr> "PY", "LD", "WB", "OR", "BR", "SK", "CT", "TN", "MP", "GU", "GA", "NL", "M...
# $ labelrank   <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",...
# $ hasc        <chr> "IN.PY", "IN.LD", "IN.WB", "IN.OR", "IN.BR", "IN.SK", "IN.CT", "IN.", "IN....
# $ alt-name    <chr> "Pondicherry|Puduchcheri|Pondichéry", "Ã\u008dles Laquedives|Laccadive|Mi...
# $ woe-id      <chr> "20070459", "2345748", "2345761", "2345755", "2345742", "2345762", "200704...
# $ fips        <chr> "IN22", "IN14", "IN28", "IN21", "IN34", "IN29", "IN37", "IN22", "IN35", "I...
# $ postal-code <chr> "PY", "LD", "WB", "OR", "BR", "SK", "CT", "TN", "MP", NA, "GA", "NL", "MN"...
# $ name        <chr> "Puducherry", "Lakshadweep", "West Bengal", "Orissa", "Bihar", "Sikkim", "...
# $ country     <chr> "India", "India", "India", "India", "India", "India", "India", "India", "I...
# $ type-en     <chr> "Union Territory", "Union Territory", "State", "State", "State", "State", ...
# $ region      <chr> "South", "South", "East", "East", "East", "East", "Central", "South", "Cen...
# $ longitude   <chr> "79.7758", "72.7811", "87.7289", "84.4341", "85.8134", "88.4482", "82.3069...
# $ woe-name    <chr> "Puducherry", "Lakshadweep", "West Bengal", "Orissa", "Bihar", "Sikkim", "...
# $ latitude    <chr> "10.9224", "11.2249", "23.0523", "20.625", "25.6853", "27.5709", "21.8044"...
# $ woe-label   <chr> "Puducherry, IN, India", "Lakshadweep, IN, India", "West Bengal, IN, India...
# $ type        <chr> "Union Territor", "Union Territor", "State", "State", "State", "State", "S...
 
DT1 <- setDT(mydat)[, c("lon", "lat"), with = FALSE] %>%
  .[lon>60&lon<90] %>%
  .[lat>5&lat<30] %>%
  unique(., by = c('lon','lat')) 

DT2 <- setDT(mapdata)[, c("longitude", "latitude", "hc-a2"), with = FALSE] %>%
  .[, c("longitude","latitude"):=lapply(.SD, as.numeric), .SDcol=c("longitude","latitude")]
colnames(DT2) <- c("lon","lat", "hca2")

# checking the uniqueness
# DT1[,.N,c("lon","lat")][N>1L]
# DT2[,.N,c("longitude","latitude")][N>1L]

# Turns out joining on nearest value based on two column isn't trivial,  
# I have adapted a fuction from stackoverflow
# https://stackoverflow.com/questions/28435126/merge-data-table-by-two-nearest-variables?noredirect=1&lq=1
func = function(u,v)
{
    vec = with(DT2, (u-lon)^2 + (v-lat)^2)
    DT2[which.min(vec),]$hca2
}

t.join.data <- transform(DT1, 'hc-a2'=apply(DT1, 1, function(u) func(u[[1]], u[[2]]))) %>%
  .[mydat, on=.(lon,lat)] %>%
  .[!is.na('hc-a2'), value:=sum(N), by="hc-a2"] %>%
  unique(., by = c('hc-a2')) %>%
  .[lon>60&lon<90] %>%
  .[lat>5&lat<30]

# hcmap("countries/in/in-all", data = t.join.data, value = "value",
#       joinBy = c("hc-a2", "hc-a2"), name = "Number of job listing",
#       dataLabels = list(enabled = TRUE, format = '{point.name}'), borderWidth = 1,
#       tooltip = list(valueDecimals = 0, valuePrefix = "", valueSuffix = "")) 

Sadly I dont see an easy way to join the mapdata inforamtion together with the job listing information to provide state/region level information. i guess the likes of Mumbai, Bengaluru, Bangalore etc are actually cities name.

I resort to plotting them with the geocoded latitude and longitude with highchart, which depicted similar data to the ggmap in previous section.

 # Setting z,color and name parameter for plotting with highcharter 
 highcharter.plot <- setDT(mydat)[lon>60&lon<90] %>%
   .[lat>5&lat<30]%>%
   .[,`:=`(
           name=m.i.location,
           z=N,
           color= colorize(N)
           )] 
 
# plotting with highcharter
hcmap("countries/in/in-all", name = "India") %>% 
  hc_add_series(data = highcharter.plot, type = "mapbubble",
                name = "Job listing", maxSize = "35") %>% 
    hc_title(text = "Locations of Job Posting") %>% 
  hc_subtitle(text = "only top 100 locations are included, the counts also include duplicates if the job is listed at multiple location") %>%
  hc_mapNavigation(enabled = TRUE) 

Job description

Output of jobs description column

head(mdt$jobdescription, 1)
## [1] "Job Description   Send me Jobs like this Qualifications: - == > 10th To Graduation & Any Skill: - == > Basic Computer Knowledge Job Requirement : - == > System or Laptop Type of job: - == > Full Time or Part time Languages : - == > Tamil & English. Experience : - == > Freshers & Experience payment details: - 1 form per day 5/- 10 form per day 50/- 100 form per day 500/- monthly you can earn 15000/- per month Selection Process: - == > Easy Selection Process,So What Are You Waiting For? Apply Now & Grab Best Opportunity To Make Your Carrier & To Improve Your Earing Skills. More detail contact Mr Hari 8678902528 9003010282 Salary:INR 1,50,000 - 2,25,000 P.A Industry: Media / Entertainment / Internet Functional Area: ITES , BPO , KPO , LPO , Customer Service , Operations Role Category:Other Role:Fresher Keyskills English Typing Part Time Data Entry Selection Process Desired Candidate Profile Education- UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, Post Graduation Not Required Please refer to the Job description above Company Profile: MM Media Pvt Ltd Mass Media International Find Live Info Pvt Ltd DATA SERVICES is established in the year 2012 with a vision of transforming an ordinary to extra ordinary thing. Find live info is one of our main project, it is a developing search engine site based at Chennai, India. With years of experience, we have crafted massive data collection job opportunities for public. FIND LIVE INFO a blend of experience, imagination, strategy and action to create digital experience that excite, challenge, inspire and engage users. FIND LIVE INFO provides other business opportunities which will be handled by experts of various fields. We have interesting combination of young and experienced Specialists Under the guidance of our Top Management we bloom towards success More detail contect 8678902528 Download PPT Photo 1   View Contact Details"

It seems this column data is not well formated. Some of the useful information such as payrate and indsutry is also already extracted into individual column by pre-crawler. Still, it would be interesting to see if we can indentify additional useful information for job description with some basic NLP processes.

It seems every job description begins with “Job Description  Send me Jobs like this”. This particular string will be removed before anyany processes.

## Removing the "Job Description   Send me Jobs like this" 
mdt$jobdescription <- str_replace(mdt$jobdescription, "Job Description", "") %>%
  str_replace(., "Send me Jobs like this", "")

# Checking the output
head(mdt$jobdescription, 1)
## [1] "    Qualifications: - == > 10th To Graduation & Any Skill: - == > Basic Computer Knowledge Job Requirement : - == > System or Laptop Type of job: - == > Full Time or Part time Languages : - == > Tamil & English. Experience : - == > Freshers & Experience payment details: - 1 form per day 5/- 10 form per day 50/- 100 form per day 500/- monthly you can earn 15000/- per month Selection Process: - == > Easy Selection Process,So What Are You Waiting For? Apply Now & Grab Best Opportunity To Make Your Carrier & To Improve Your Earing Skills. More detail contact Mr Hari 8678902528 9003010282 Salary:INR 1,50,000 - 2,25,000 P.A Industry: Media / Entertainment / Internet Functional Area: ITES , BPO , KPO , LPO , Customer Service , Operations Role Category:Other Role:Fresher Keyskills English Typing Part Time Data Entry Selection Process Desired Candidate Profile Education- UG: B.Tech/B.E. - Any Specialization PG:Any Postgraduate - Any Specialization, Post Graduation Not Required Please refer to the Job description above Company Profile: MM Media Pvt Ltd Mass Media International Find Live Info Pvt Ltd DATA SERVICES is established in the year 2012 with a vision of transforming an ordinary to extra ordinary thing. Find live info is one of our main project, it is a developing search engine site based at Chennai, India. With years of experience, we have crafted massive data collection job opportunities for public. FIND LIVE INFO a blend of experience, imagination, strategy and action to create digital experience that excite, challenge, inspire and engage users. FIND LIVE INFO provides other business opportunities which will be handled by experts of various fields. We have interesting combination of young and experienced Specialists Under the guidance of our Top Management we bloom towards success More detail contect 8678902528 Download PPT Photo 1   View Contact Details"
require(tm)
require(quanteda)
require(wordcloud)
require(slam) # for row_sums
require(RColorBrewer)
### create Document Feature Matrix 
dfm.txt <- corpus(mdt$jobdescription, docvars = data.frame(grp = mdt$industry)) %>%
  dfm(., groups = "grp", 
      stem = T, 
      remove=stopwords("english"),
      remove_symbols = T,
      remove_punct = T,
      remove_numbers = T,
      remove_url = T)

dfm.txt
## Document-feature matrix of: 64 documents, 57,132 features (94.7% sparse).

Comparison of wordcloud by u=industry

Note that this only consider the 1st/primary industry of the job records

## Summarizing and identifying the most popular 8th industry 
popular_industry_count <- mdt %>%
  .[!is.na(industry),.N,by=list(industry)] %>%
  .[,.SD[order(-N)]] %>%
  .[, head(.SD, 8),]

# popular_industry_count$industryS1
# tfidf(wikiDfm)

textplot_wordcloud(dfm_select(dfm.txt, documents=popular_industry_count$industry[1:4]), 
                   comparison = T, rot.per = .25, scale=c(2,.55),title.size =1)

textplot_wordcloud(dfm_select(dfm.txt, documents=popular_industry_count$industry[5:8]), 
                   comparison = T, rot.per = .25, scale=c(2,.55),title.size =1)

Extracting Specific frequency of strings - Programming Langauges

## Making a dictionary list of common and popular string list of programming languages
myDict <- dictionary(list(git="git",
                          c = "c",
                          python = "python",
                          java="java",
                          cpp = "c++",
                          chash= "csharp",
                          r="r",
                          javascript="javascript",
                          php= "php",
                          go="go",
                          swift="swift"
                          ))
# the later two langauge might have issue identifying the correct frequency of the word as the names can also double as common word.

## Applying the dictionary just defined on the job description column.
ProgrammingL_popularity <- dfm(mdt$jobdescription, dictionary = myDict)

## Topfeatures() to extract popular languages and its counts
dt.ProgrammingL<-data.frame(Language = names(topfeatures(ProgrammingL_popularity)),
                            N=topfeatures(ProgrammingL_popularity))

## fct_inorder to reorder the langauge with the 
setDT(dt.ProgrammingL)[,Language:= fct_inorder(factor(Language))] %>%
  ggvis(~Language, ~N) %>% 
  layer_bars(fillOpacity := 0.1) %>%
  add_axis("x", title = "Programming Language",
           properties = axis_props(labels = list(angle = 45, align = "left", fontSize = 15))) %>%
  add_axis("y", title = "Number of instance Mentioned")
## To extract frequency of words
# freq <- textstat_frequency(ProgrammingL_popularity)
# "textstat_frequency" is only available into quentda 0.99, which doesn't seem to be available by default at the moment

Strangely, Java is the most popular languages, it was mentioned 5000 times in job description. It is follower by c and javascript, though they are only half as popular. R is also rather popular, even somehow being more popular than python too.

Cross Data Analysis

# Preserving only extracted data/useful data in the dt
mdt <- mdt %>%
  .[,(colnames(.) %like% "^m."| colnames(.)=="industry"| colnames(.)=="uniq_id"), 
             with=FALSE]

# Another modified data table with with multiple listing of locations as multiple entries
mdt.melt.location <- mdt.melt.location %>%
  .[,(colnames(.) %like% "^m."| colnames(.)=="industry"| colnames(.)=="uniq_id"), with=FALSE]

Industry, Payrate, Experience

hcboxplot(x = mdt$m.payrate.Mean, 
          var = mdt$industry) %>% 
  hc_chart(type = "column") %>% 
  hc_title(text = "Plot of Payrate(Mean, PA) vs Industry") 

If we were to ignore the outliers and raw data (which would be heavily influenced by the amount of job listing in that particular industry), it seems that “Iron and Steel”, “Pulp and Paper”, “IT-Software” and “Strategy/Management consulting Firm” command the best payrate.

Although, one would be mistaken to conclude straight away that that “Pul and Paper” and the likes are the best industry (Pay wise) to work in, as this ignore other factors such as number of job listing/payrate recorded, as well other not specifed requirement.

The table below listed the Median Pay of listing by different industries.

mdt[!is.na(m.payrate.Mean),.(.N, 
                              Q1=quantile(m.payrate.Mean, 0.25),
                              PayMed=median(m.payrate.Mean),
                              Q3=quantile(m.payrate.Mean, 0.75)), by=industry] %>%
  .[,.SD[order(-PayMed)]] %>%
  .[, head(.SD, 30),] 
## # A tibble: 30 x 5
##                                  industry     N     Q1  PayMed      Q3
##                                    <fctr> <int>  <dbl>   <dbl>   <dbl>
##  1                         Pulp and Paper     2 784375 1231250 1678125
##  2 Strategy / Management Consulting Firms    35 487500 1000000 1475000
##  3                         Iron and Steel     6 362500  975000 1475000
##  4               Ceramics / Sanitary ware    13 500000  800000  850000
##  5                                  Tyres     1 750000  750000  750000
##  6                      Glass / Glassware     5 700000  725000 1100000
##  7                            Telecom/ISP    15 312500  687500 1137500
##  8                   Brewery / Distillery     3 487500  625000  912500
##  9               IT-Hardware & Networking    79 293750  600000 1025000
## 10     Water Treatment / Waste Management     4 468750  587500  675000
## # ... with 20 more rows

Evidently, “Pulp and Paper” current payrate data is derived from only two job listing, thus it’s pay scale is likely heavily skrewed; The abonormally high PayScale might be reserved for c-level employee. Also, we have zero clue with regards of how the data (which is a subset of the actual listing) scale when the full database are considered. Regardless, it is probably safe to assume that the current datasets are a random subset of the actual full datasets. Hence, having more listing of jobs in a particular industry would probably portray a clearer picture with regards to pay.

Thus, if we were to restrict our boxplot to industry with more than 30 listing.

t.filtered.mdt <- mdt[!is.na(m.payrate.Mean), if (.N > 30) .SD, by = industry]
  
  hcboxplot(x = t.filtered.mdt$m.payrate.Mean, 
          var = t.filtered.mdt$industry) %>% 
  hc_chart(type = "column") %>% 
  hc_title(text = "Plot of Payrate(Mean, PA) vs Industry") 

The clear winners are “IT - Sofreware”, “Internet” and “Strategy/Management consulting Firm”.

In the form of table:

mdt[!is.na(m.payrate.Mean), if (.N > 30) .SD, by = industry]%>%
  .[!is.na(m.payrate.Mean), .(.N, 
                              Q1=quantile(m.payrate.Mean, 0.25),
                              PayMed=median(m.payrate.Mean),
                              Q3=quantile(m.payrate.Mean, 0.75)), by=industry] %>%
  .[,.SD[order(-PayMed)]] %>%
  .[, head(.SD, 10),]
## # A tibble: 10 x 5
##                                         industry     N     Q1  PayMed
##                                           <fctr> <int>  <dbl>   <dbl>
##  1        Strategy / Management Consulting Firms    35 487500 1000000
##  2                      IT-Hardware & Networking    79 293750  600000
##  3               IT-Software / Software Services  1249 337500  550000
##  4             Textiles / Garments / Accessories    51 312500  550000
##  5                       FMCG / Foods / Beverage    82 265625  500000
##  6        Banking / Financial Services / Broking   279 300000  475000
##  7                          Accounting / Finance    74 265625  450000
##  8 Oil and Gas / Energy / Power / Infrastructure    34 275000  406250
##  9              Media / Entertainment / Internet    53 250000  400000
## 10                        Real Estate / Property    82 300000  400000
## # ... with 1 more variables: Q3 <dbl>
mdt[!is.na(m.payrate.Mean), if (.N > 30) .SD, by = industry]%>%
  .[!is.na(m.payrate.Mean), .(.N, 
                              Q1=quantile(m.payrate.Mean, 0.25),
                              PayMed=median(m.payrate.Mean),
                              Q3=quantile(m.payrate.Mean, 0.75)), by=industry] %>%
  .[,.SD[order(-PayMed)]] %>%
  .[, tail(.SD, 10),]
## # A tibble: 10 x 5
##                                               industry     N     Q1 PayMed
##                                                 <fctr> <int>  <dbl>  <dbl>
##  1                                  Retail / Wholesale    61 225000 350000
##  2    Courier / Transportation / Freight / Warehousing    39 168750 350000
##  3 Travel / Hotels / Restaurants / Airlines / Railways   128 212500 318750
##  4                                               Other    77 175000 312500
##  5               Industrial Products / Heavy Machinery    75 206250 300000
##  6                                           Insurance    86 250000 300000
##  7                    Medical / Healthcare / Hospitals   179 250000 300000
##  8                     Education / Teaching / Training   138 203125 300000
##  9                            BPO / Call Centre / ITES   741 212500 275000
## 10                              Recruitment / Staffing   115 162500 225000
## # ... with 1 more variables: Q3 <dbl>

On the end of tial, some of the industries’s pay are pathetic - both the median of the industry and the range ( Q3-Q1) is fairly minute. This probably shows both the starting pay and the latitude/margin to growth in these industry are limited. This is exemplify by “BPO/Call Centre”, “Recruitment/Staffing”, “Teaching/Training” and “Healthcare”.

Also, it is interesting to note that some of the low paying industry - “Call Centre”, “Education”, “Hospital” are industries that posted most job on the database for the duration of the data subset.

hcboxplot(x = mdt$m.payrate.Mean, 
          var = mdt$m.experience.Median) %>% 
  hc_chart(type = "column") %>% 
  hc_title(text = "Plot of Payrate(Mean, PA) vs experience (yrs)") 
filtered.IMP <- mdt[!is.na(m.experience.Median) & !is.na(m.payrate.Mean),
    .SD, .SDcols=c("m.experience.Median","m.payrate.Mean")]

tempt_ggplot <- filtered.IMP %>%
  ggplot(aes(x=factor(m.experience.Median), y=m.payrate.Mean)) +  
  geom_boxplot() +
  geom_smooth(method="lm", se=FALSE, aes(group=1)) +
    geom_text(x = 30, y = 4, label = sprintf("adj_r2: %s",
        signif(
          glance(lm(m.experience.Median~m.payrate.Mean, filtered.IMP))$adj.r.squared,
               3)), parse = TRUE)
ggplotly(tempt_ggplot) %>%
  layout(title = 'Payrate vs Experience',
         xaxis = list(title = 'Experience (years)[range]', tickangle = 90),
         yaxis = list(title = 'Annual Payrate'))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## archieve of using ggvis for plotting

# mdt[!is.na(m.payrate.Mean)] %>%
#   ggvis(~industryS1, ~m.payrate.Mean) %>%
#   layer_boxplots(fillOpacity := 0.1, width = 0.5) %>% 
#   add_axis("x", title = "Payrate PA Distribution",
#            properties = axis_props(labels = list(angle = 90, align = "left", fontSize = 10))) %>%
#   add_axis("y", title = "IndustryS1")

# mdt[!is.na(m.experience.Median)] %>%
#   ggvis(~industryS1, ~m.experience.Median) %>%
#   layer_boxplots(fillOpacity := 0.1, width = 0.5) %>% 
#   add_axis("x", title = "Payrate PA Distribution",
#            properties = axis_props(labels = list(angle = 90, align = "left", fontSize = 10))) %>%
#   add_axis("y", title = "IndustryS1")

Exponentially Wage Growth?

If we assume that Pay/Wage grew exponentially \[Pay = Base * (1 + R) ^ T\] \[Log(Pay) = T * Log(Base * (1 + R))\] With regards to years

\[Log(m.experience.Median) = Log(Pay)\]

tempt_ggplot <- filtered.IMP %>%
  ggplot(aes(x=factor(m.experience.Median), y=log(m.payrate.Mean))) +  
  geom_boxplot() +
  geom_smooth(method="lm", formula=y~x, se=FALSE, aes(group=1)) +
  geom_text(x = 30, y = 4, label = sprintf("adj_r2: %s",
        signif(
          glance(lm(m.experience.Median~log(m.payrate.Mean), filtered.IMP))$adj.r.squared,
               3)), parse = TRUE)
ggplotly(tempt_ggplot)  %>%
  layout(title = 'log(Payrate) vs Experience',
         xaxis = list(title = 'Experience (years)[range]', tickangle = 90),
         yaxis = list(title = 'log(Annual Payrate)'))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

While both plots (with log(Pay) and pay) have ~ 0 p-value, the adjusted r-square of linear plot are higher than that of log plot, although 0.52 it is still not very significant. But perhaps we can narrow down to specific industry in later sections and explore in full.

hcboxplot(x = mdt$m.payrate.Mean, 
          var = mdt$m.experience.Cut,
          var2 = mdt$industry,
          outliers = FALSE) %>% 
  hc_chart(type = "column") %>%
  hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Industry")

Education vs Payrate

This section examine the payrate commanded by people with different amount of Education with respect to their level of education.

mdt.melt.Education <- melt(mdt[!is.na(m.payrate.Mean)],
     measure.vars = patterns("^m.Education.tf"), 
     na.rm=T)

hcboxplot(x = mdt.melt.Education$m.payrate.Mean, 
          var = mdt.melt.Education$m.experience.Cut, 
          var2 = mdt.melt.Education$variable,
          outliers = FALSE) %>% 
  hc_chart(type = "column") %>%
  hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")

It seems that the level of education for fresh gradautes and people with up to 3 years worth of working experience does not resulted in a significant difference in payrate (ignoring outliers). owever, the difference in payscale of them may not appear significant when compare to other groups due to differences in magnitudes (y-axis).

On the next plot, the salary differences for poeple with little to no experience would be examined.

# Filtering to plot for only those jobs that listed 0 to 3 job experience.
hcboxplot(x = mdt.melt.Education[m.experience.Cut=="[0,3)"]$m.payrate.Mean, 
          var = mdt.melt.Education[m.experience.Cut=="[0,3)"]$variable,
          name = "Length", color = "#2980b9") 

It is possible that some of the job listings that depicted no experience requirement are in fact posts that required significant amount of working experience. The extracted experience requirement may not be able to capture the exact requirement if it is not listed explicitly.

Alternative accounting of Education

The previous sections considers a job that listed multiple education requirement as multiple of job listing, Ie a job listed to be application to degree holder (UG), diploma, and PG would be melted into multiple entries for each level of education.

This, in effect, would lower the actual differences of payrates between different level of education. In actual market (which means the current employer market), those with advance degree are generally heavily favoured (not to say experience doesnt matter) while those with lower level of education would find it hard to compete.

Few employees, given a choice, would be willing to underemploy/undersell themselves for position that requires lower level of education or gives a loewr payrate.

hcboxplot(x = mdt$m.payrate.Mean, 
          var = mdt$m.experience.Cut, 
          var2 = mdt$m.Education.factor,
          outliers = FALSE) %>% 
  hc_chart(type = "column") 
# table<-mdt%>%
#   .[,med.pay:=median(m.payrate.Mean), by =c("m.Education.factor", "m.experience.Cut") ]

Overall, it is quite obvious that assigning solely the single highest education requirement to individual job listing yield a much greater difference in their pay. I think this is a more accurate picture in reflecting the job market.

With regards to Industry

# industry_cout_table_40<-mdt[!is.na(m.payrate.Mean), count:.N, by= industry] %>%
#   .[,.SD[order(-N)]] %>%
#   head(., 40)
#   
# industry_cout_table_40$industry <- droplevels(industry_cout_table_40$industry)
# factor(industry_cout_table_40$industry)
# 
# mdt[industry_cout_table_40$industry,]
# 
# (colnames(mdt) %in% industry_cout_table_40$industry)

unique(mdt$industry)
##  [1] Media / Entertainment / Internet                          
##  [2] Advertising / PR / MR / Event Management                  
##  [3] IT-Software / Software Services                           
##  [4] Banking / Financial Services / Broking                    
##  [5] Aviation / Aerospace Firms                                
##  [6] Industrial Products / Heavy Machinery                     
##  [7] FMCG / Foods / Beverage                                   
##  [8] Recruitment / Staffing                                    
##  [9] Internet / Ecommerce                                      
## [10] Travel / Hotels / Restaurants / Airlines / Railways       
## [11] BPO / Call Centre / ITES                                  
## [12] Pharma / Biotech / Clinical Research                      
## [13] Real Estate / Property                                    
## [14] Insurance                                                 
## [15] Facility Management                                       
## [16] Publishing                                                
## [17] Education / Teaching / Training                           
## [18] Retail / Wholesale                                        
## [19] Automobile / Auto Anciliary / Auto Components             
## [20] Chemicals / PetroChemical / Plastic / Rubber              
## [21] Government / Defence                                      
## [22] Accounting / Finance                                      
## [23] Textiles / Garments / Accessories                         
## [24] Semiconductors / Electronics                              
## [25] Telecom/ISP                                               
## [26] Medical / Healthcare / Hospitals                          
## [27] Legal                                                     
## [28] Courier / Transportation / Freight / Warehousing          
## [29] NGO / Social Services / Regulators / Industry Associations
## [30] Other                                                     
## [31] Architecture / Interior Design                            
## [32] KPO / Research / Analytics                                
## [33] Construction / Engineering / Cement / Metals              
## [34] Office Equipment / Automation                             
## [35] Consumer Electronics / Appliances / Durables              
## [36] Iron and Steel                                            
## [37] Strategy / Management Consulting Firms                    
## [38] Oil and Gas / Energy / Power / Infrastructure             
## [39] IT-Hardware & Networking                                  
## [40] Wellness / Fitness / Sports / Beauty                      
## [41] Agriculture / Dairy                                       
## [42] Electricals / Switchgears                                 
## [43] Security / Law Enforcement                                
## [44] Gems / Jewellery                                          
## [45] Ceramics / Sanitary ware                                  
## [46] Fresher / Trainee / Entry Level                           
## [47] Food Processing                                           
## [48] Printing / Packaging                                      
## [49] Brewery / Distillery                                      
## [50] Telecom/ISP /                                             
## [51] Export / Import                                           
## [52] Heat Ventilation / Air Conditioning                       
## [53] Wellness / Fitness / Sports                               
## [54] Water Treatment / Waste Management                        
## [55] <NA>                                                      
## [56] Shipping / Marine                                         
## [57] Glass / Glassware                                         
## [58] Animation / Gaming                                        
## [59] Fertilizers / Pesticides                                  
## [60] Pulp and Paper                                            
## [61] Tyres                                                     
## [62] Leather                                                   
## [63] Broadcasting                                              
## [64] Medical Devices / Equipments                              
## 64 Levels: Accounting / Finance ...
# limited the plotting to industry with more than 100 listing
t.filtered.mdt <- mdt[!is.na(m.payrate.Mean), if (.N > 150) .SD, by = industry] %>%
  droplevels(.)

hcboxplot(x = t.filtered.mdt$m.payrate.Mean, 
          var = t.filtered.mdt$m.experience.Cut, 
          var2 = t.filtered.mdt$industry,
          outliers = FALSE) %>% 
  hc_chart(type = "column") %>%
  hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")
# limited the plotting to industry with more than 100 listing
t.filtered.mdt <- mdt[!is.na(m.payrate.Mean), if (.N > 110 & .N<150) .SD, by = industry] %>%
  droplevels(.)

hcboxplot(x = t.filtered.mdt$m.payrate.Mean, 
          var = t.filtered.mdt$m.experience.Cut, 
          var2 = t.filtered.mdt$industry,
          outliers = FALSE) %>% 
  hc_chart(type = "column") %>%
  hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")
mdt[!is.na(m.payrate.Mean)&!is.na(m.experience.Median)&!is.na(industry),
    list(intercept=coef(lm(m.payrate.Mean~m.experience.Median))[1],
         coef=coef(lm(m.payrate.Mean~m.experience.Median))[2],
         p.value=summary(lm(m.payrate.Mean~m.experience.Median))$coefficients[,4] ),
    by=industry]
## # A tibble: 121 x 4
##                                    industry  intercept      coef
##                                      <fctr>      <dbl>     <dbl>
##  1         Media / Entertainment / Internet   92674.30  97973.74
##  2         Media / Entertainment / Internet   92674.30  97973.74
##  3 Advertising / PR / MR / Event Management    6816.69 158496.92
##  4 Advertising / PR / MR / Event Management    6816.69 158496.92
##  5          IT-Software / Software Services  -60679.97 164657.08
##  6          IT-Software / Software Services  -60679.97 164657.08
##  7    Industrial Products / Heavy Machinery -151875.01 140574.62
##  8    Industrial Products / Heavy Machinery -151875.01 140574.62
##  9                   Recruitment / Staffing   47271.63  88297.83
## 10                   Recruitment / Staffing   47271.63  88297.83
## # ... with 111 more rows, and 1 more variables: p.value <dbl>
mdt[!is.na(m.payrate.Mean), .N, by=industry] %>%
  .[N>10,] %>%
  .[,.SD[order(-N)]] 
## # A tibble: 41 x 2
##                                               industry     N
##                                                 <fctr> <int>
##  1                     IT-Software / Software Services  1249
##  2                            BPO / Call Centre / ITES   741
##  3              Banking / Financial Services / Broking   279
##  4       Automobile / Auto Anciliary / Auto Components   195
##  5                    Medical / Healthcare / Hospitals   179
##  6                     Education / Teaching / Training   138
##  7 Travel / Hotels / Restaurants / Airlines / Railways   128
##  8                                Internet / Ecommerce   116
##  9                              Recruitment / Staffing   115
## 10                Pharma / Biotech / Clinical Research   109
## # ... with 31 more rows
mdt[!is.na(m.payrate.Mean)&!is.na(m.experience.Median)&!is.na(industry),] %>%
  group_by(industry) %>%
  do(tidy(lm(m.payrate.Mean ~ m.experience.Median, data = .)))
## # A tibble: 121 x 6
## # Groups:   industry [61]
##                                    industry                term   estimate
##                                      <fctr>               <chr>      <dbl>
##  1                     Accounting / Finance         (Intercept)  192261.14
##  2                     Accounting / Finance m.experience.Median   92255.89
##  3 Advertising / PR / MR / Event Management         (Intercept)    6816.69
##  4 Advertising / PR / MR / Event Management m.experience.Median  158496.92
##  5                      Agriculture / Dairy         (Intercept) -263829.02
##  6                      Agriculture / Dairy m.experience.Median  145962.64
##  7                       Animation / Gaming         (Intercept)   35227.27
##  8                       Animation / Gaming m.experience.Median   52272.73
##  9           Architecture / Interior Design         (Intercept)  109488.29
## 10           Architecture / Interior Design m.experience.Median   65050.20
## # ... with 111 more rows, and 3 more variables: std.error <dbl>,
## #   statistic <dbl>, p.value <dbl>

Case study of a industry

Since we have most abundant data for the listing of jobs for “IT - Software / Software services” industry. Let’s explore this particular industry in more depth.

IT - Software / Software services

## Time of posting
# fitering for only data with valid payrate data and IT software industry
mdt.melt.location %>%
  .[!is.na(m.payrate.Mean)&industry=="IT-Software / Software Services",] %>%
  .[!is.na(m.timeCut), .N, by=m.timeCut] %>%
  ggvis(~m.timeCut, ~N) %>%
  layer_bars(fillOpacity := 0.1) %>%
  add_axis("x", properties = axis_props(
    labels = list(angle = 45, align = "left", fontSize = 10)
  ))
# # melting for education , this would create duplicate job posting for difference level of education
# mdt.melt.location.Education <- melt(mdt.melt.location, 
#      measure.vars = patterns ("^m.Education.tf."), na.rm = T, variable.name = "EducationLevel") 
# mdt.melt.location.Education$value <- NULL 
# 
# mdt.melt.location.Education <- mdt.melt.location.Education %>%
#   .[!is.na(m.payrate.Mean)|industry=="IT-Software / Software Services",]
#   
# colnames(mdt.melt.location)

t.filtered.mdt<- mdt %>%
  .[!is.na(m.payrate.Mean)&industry=="IT-Software / Software Services"] 

# depiciting only the highest educational level per job listing
hcboxplot(x = t.filtered.mdt$m.payrate.Mean, 
          var = t.filtered.mdt$m.experience.Cut, 
          var2 = t.filtered.mdt$m.Education.factor, 
          outliers = FALSE) %>% 
  hc_chart(type = "column") %>%
  hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")
## Highcharter plot to depict the location of job posting on map
highcharter.plot <- mdt.melt.location %>%
  .[!is.na(m.payrate.Mean)&industry=="IT-Software / Software Services",] %>%
  .[, .(.N, Paymedian=median(m.payrate.Mean), 
        Q1=quantile(m.payrate.Mean, 0.25),
        Q3=quantile(m.payrate.Mean, 0.75)), by = m.i.location] %>%
  .[order(-N)] %>%
  .[!is.na(m.i.location)] %>%
  .[mydat, on="m.i.location"]%>%
  .[lon>60&lon<90] %>%
  .[lat>5&lat<30] %>%
  .[,`:=`(
           name=m.i.location,
           z=N,
           color= colorize(Paymedian)
           )] 
  
## highcharter
hcmap("countries/in/in-all", name = "India") %>% 
  hc_add_series(data = highcharter.plot, type = "mapbubble",
                name = "Job listing", maxSize = "25") %>% 
    hc_title(text = "Locations of Job Posting for IT Software") %>% 
  hc_subtitle(text = "only top 100 locations are included, the calculation allowed multiple entries if the job is listed at multiple location") %>%
hc_tooltip(useHTML = TRUE,
             headerFormat = "<table>",
             pointFormat = paste("<tr><th colspan=\"1\">N Job Listing<td>{point.z}</td></th></tr>",
                                 "<tr><th>Location Name</th><td>{point.name} </td></tr>",
                                 "<tr><th>Pay Median</th><td>{point.Paymedian} </td></tr>",
                                 "<tr><th>Pay Q1</th><td>{point.Q1} </td></tr>",
                                 "<tr><th>Pay Q3</th><td>{point.Q3} </td></tr>"),
             footerFormat = "</table>")
growth
# # Fit lm model using 5 x 5-fold CV: model
# model <- train(
#   m.payrate.Mean ~ m.experience.Median + EducationLevel, 
#   mdt.melt.location.Education,
#   method = "rpart", na.action = na.omit,
#   trControl = trainControl(
#     method = "cv", number = 5, verboseIter = TRUE
#   )
# )
# 
# # Print model to console
# model
# depiciting only the highest educational level per job listing
hcboxplot(x = t.filtered.mdt$m.payrate.Mean, 
          var = t.filtered.mdt$m.experience.Cut, 
          var2 = t.filtered.mdt$m.Education.factor, 
          outliers = FALSE) %>% 
  hc_chart(type = "column") %>%
  hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education")%>%
  hc_tooltip(useHTML = TRUE,
             headerFormat = "<table>",
             pointFormat = paste("<tr><th colspan=\"1\">N Job Listing<td>{point.z}</td></th></tr>",
                                 "<tr><th>Location Name</th><td>{point.name} </td></tr>",
                                 "<tr><th>Pay Median</th><td>{point.Paymedian} </td></tr>",
                                 "<tr><th>Pay Q1</th><td>{point.Q1} </td></tr>",
                                 "<tr><th>Pay Q3</th><td>{point.Q3} </td></tr>"),
             footerFormat = "</table>")

Healtcare

## Time of posting
# fitering for only data with valid payrate data and IT software industry
mdt %>%
  .[!is.na(m.payrate.Mean)&industry=="Medical / Healthcare / Hospitals"] %>%
  .[!is.na(m.timeCut), .N, by=m.timeCut] %>%
  ggvis(~m.timeCut, ~N) %>%
  layer_bars(fillOpacity := 0.1) %>%
  add_axis("x", properties = axis_props(
    labels = list(angle = 45, align = "left", fontSize = 10)
  ))

This may yield false information as the trends may be results of data subseting, though, it certainly shows that most of the job listing for healthcare occured around May, March for 2016.

## Highcharter plot to depict the location of job posting on map
highcharter.plot <- mdt.melt.location %>%
  .[!is.na(m.payrate.Mean)&industry=="Medical / Healthcare / Hospitals",] %>%
  .[, .(.N, Paymedian=median(m.payrate.Mean), 
        Q1=quantile(m.payrate.Mean, 0.25),
        Q3=quantile(m.payrate.Mean, 0.75)), by = m.i.location] %>%
  .[order(-N)] %>%
  .[!is.na(m.i.location)] %>%
  .[mydat, on="m.i.location"]%>%
  .[lon>60&lon<90] %>%
  .[lat>5&lat<30] %>%
  .[,`:=`(
           name=m.i.location,
           z=N,
           color= colorize(Paymedian)
           )] 
  
# Highcharter
hcmap("countries/in/in-all", name = "India") %>% 
  hc_add_series(data = highcharter.plot, type = "mapbubble",
                name = "Job listing", maxSize = "25") %>% 
    hc_title(text = "Locations of Job Posting for Healthcare") %>% 
  hc_subtitle(text = "only top 100 locations are included, the calculation allowed multiple entries if the job is listed at multiple location") %>%
hc_tooltip(useHTML = TRUE,
             headerFormat = "<table>",
             pointFormat = paste("<tr><th colspan=\"1\">N Job Listing<td>{point.z}</td></th></tr>",
                                 "<tr><th>Location Name</th><td>{point.name} </td></tr>",
                                 "<tr><th>Pay Median</th><td>{point.Paymedian} </td></tr>",
                                 "<tr><th>Pay Q1</th><td>{point.Q1} </td></tr>",
                                 "<tr><th>Pay Q3</th><td>{point.Q3} </td></tr>"),
             footerFormat = "</table>")
growth
t.filtered.mdt<-mdt %>%
  .[!is.na(m.payrate.Mean)&industry=="Medical / Healthcare / Hospitals"] 

hcboxplot(x = t.filtered.mdt$m.payrate.Mean, 
          var = t.filtered.mdt$m.experience.Cut, 
          var2 = t.filtered.mdt$m.Education.factor,
          outliers = FALSE) %>% 
  hc_chart(type = "column") %>%
  hc_title(text = "Plot of Payrate(Mean, PA) vs Experience (Yrs) segregated on Education") %>%
  hc_subtitle(text = "the extra series is allocated for job listing with no clear education requirement")

Location difference

require(leaflet)
## Loading required package: leaflet
## Warning: package 'leaflet' was built under R version 3.3.3
software.icons <- awesomeIcons(
  icon = 'fa-github',
  library = 'fa'
)
internet.icons <- awesomeIcons(
  icon = 'fa-internet-explorer',
  library = 'fa'
)
bank.icon <- awesomeIcons(
  icon = 'fa-institution',
  library = 'fa'
)
factory.icon <- awesomeIcons(
  icon = '  fa-industry',
  library = 'fa'
)
recruit.icon <- awesomeIcons(
  icon = 'fa-handshake-o',
  library = 'fa'
)
hospital.icon <- awesomeIcons(
  icon = 'fa-h-square ',
  library = 'fa'
)
pharma.icon <- awesomeIcons(
  icon = 'fa-medkit',
  library = 'fa'
)
phone.icon <- awesomeIcons(
  icon = 'fa-phone',
  library = 'fa'
)

# join mdt.melt.locaiont ( which duplicates for location ) with geocoded address(mydat)
combined<- mdt.melt.location[,if(.N>1000).SD, by=industry] %>%
  .[,if(.N>200).N, by=.(industry, m.i.location)] %>%
  .[mydat, on="m.i.location"] %>%
  .[!is.na(industry)] %>%
  .[lon>60&lon<90] %>%
  .[lat>5&lat<30]%>%
  droplevels(.) 

list_industry<-unique(combined$industry)


individual_listing <- mdt.melt.location[, .(industry, m.i.location)] %>%
  .[industry %in% list_industry] %>%
  .[,.N, by=c("industry", "m.i.location")]%>%
  .[mydat, on="m.i.location"] %>%
  .[!is.na(lon)|!is.na(lat)]%>%
  .[lon>60&lon<90] %>%
  .[lat>5&lat<30]%>%
    droplevels(.)

individual_listing[industry=="IT-Software / Software Services"]
## # A tibble: 67 x 6
##                           industry m.i.location     N   i.N      lon
##                             <fctr>        <chr> <int> <int>    <dbl>
##  1 IT-Software / Software Services    Bengaluru  4501  9760 77.59456
##  2 IT-Software / Software Services    Bangalore  3429  7335 77.59456
##  3 IT-Software / Software Services       Mumbai  1939  5959 72.87766
##  4 IT-Software / Software Services    Hyderabad  2145  4570 78.48667
##  5 IT-Software / Software Services        Delhi  1349  4103 77.10249
##  6 IT-Software / Software Services      Chennai  1284  3369 80.27072
##  7 IT-Software / Software Services Secunderabad  1440  2858 78.49827
##  8 IT-Software / Software Services        Noida  1274  2063 77.39103
##  9 IT-Software / Software Services      Gurgaon   772  1909 77.02664
## 10 IT-Software / Software Services         Pune   521  1228 73.85674
## # ... with 57 more rows, and 1 more variables: lat <dbl>
leaflet(individual_listing) %>% addTiles() %>%
  addProviderTiles(providers$CartoDB.Positron)%>%
  #overlay groups
# addAwesomeMarkers(~lon, ~lat, icon=hospital.icon, label=~industry,
#                   data =individual_listing[industry=="Medical / Healthcare / Hospitals"],
#                   clusterOptions = T, group ="Healthcare") %>%
# addAwesomeMarkers(~lon, ~lat, icon=software.icons, label=~as.character(industry),
#                   data =individual_listing[industry=="IT-Software / Software Services"],
#                   clusterOptions = T, group = "IT") %>%
# addAwesomeMarkers(~lon, ~lat, icon=phone.icon, label=~as.character(industry),
#                   data =individual_listing[industry=="BPO / Call Centre / ITES"],
#                   clusterOptions = T, group ="Call Centre")  %>%
# addAwesomeMarkers(~lon, ~lat, icon=bank.icon, label=~as.character(industry),
#                   data =individual_listing[industry=="Strategy / Management Consulting Firms"],
#                   clusterOptions = T, group = "Strategy") %>%
# addAwesomeMarkers(~lon, ~lat, icon=recruit.icon, label=~as.character(industry),
#                   data =individual_listing[industry=="Recruitment / Staffing"],
#                   clusterOptions = T, group = "Recruit") %>%
addCircles(~lon, ~lat, stroke = F, group = "recruit",
           data =individual_listing[industry=="IT-Software / Software Services"] ) %>%
  # Layers control
  addLayersControl(
    overlayGroups = c("Medical", "IT", "Call Centre", "Strategy", "Recruit"),
    options = layersControlOptions(collapsed = FALSE)
  )
individual_listing[industry=="IT-Software / Software Services"]
## # A tibble: 67 x 6
##                           industry m.i.location     N   i.N      lon
##                             <fctr>        <chr> <int> <int>    <dbl>
##  1 IT-Software / Software Services    Bengaluru  4501  9760 77.59456
##  2 IT-Software / Software Services    Bangalore  3429  7335 77.59456
##  3 IT-Software / Software Services       Mumbai  1939  5959 72.87766
##  4 IT-Software / Software Services    Hyderabad  2145  4570 78.48667
##  5 IT-Software / Software Services        Delhi  1349  4103 77.10249
##  6 IT-Software / Software Services      Chennai  1284  3369 80.27072
##  7 IT-Software / Software Services Secunderabad  1440  2858 78.49827
##  8 IT-Software / Software Services        Noida  1274  2063 77.39103
##  9 IT-Software / Software Services      Gurgaon   772  1909 77.02664
## 10 IT-Software / Software Services         Pune   521  1228 73.85674
## # ... with 57 more rows, and 1 more variables: lat <dbl>
individual_listing[industry=="Medical / Healthcare / Hospitals"]
## # A tibble: 0 x 6
## # ... with 6 variables: industry <fctr>, m.i.location <chr>, N <int>,
## #   i.N <int>, lon <dbl>, lat <dbl>
unique(individual_listing$industry)
##  [1] IT-Software / Software Services              
##  [2] Recruitment / Staffing                       
##  [3] Banking / Financial Services / Broking       
##  [4] Internet / Ecommerce                         
##  [5] Pharma / Biotech / Clinical Research         
##  [6] BPO / Call Centre / ITES                     
##  [7] Education / Teaching / Training              
##  [8] Automobile / Auto Anciliary / Auto Components
##  [9] Strategy / Management Consulting Firms       
## [10] Construction / Engineering / Cement / Metals 
## 10 Levels: Automobile / Auto Anciliary / Auto Components ...
leaflet(individual_listing) %>% addTiles() %>%  
  addProviderTiles(providers$CartoDB.Positron)%>%
   #overlay groups
addCircleMarkers(~lon, ~lat, popup=~industry,
                  data =individual_listing[industry=="IT-Software / Software Services"],
                  clusterOptions = T, group ="IT") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
                  data =individual_listing[industry=="Recruitment / Staffing"],
                  clusterOptions = T, group ="Recruit") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
                  data =individual_listing[industry=="Banking / Financial Services / Broking"],
                  clusterOptions = T, group ="Bank") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
                  data =individual_listing[industry==list_industry[[4]]],
                  clusterOptions = T, group ="Internet") %>%
addCircleMarkers(~lon, ~lat, popup=~industry,
                  data =individual_listing[industry==list_industry[[5]]],
                  clusterOptions = T, group ="Pharma") %>%
  # Layers control
  addLayersControl(
    overlayGroups = c("IT", "Recruit", "Bank", "Internet", "Pharma"),
    options = layersControlOptions(collapsed = FALSE)
  )
pal <- colorFactor("viridis", domain = unique(individual_listing$industry))


leaflet(individual_listing) %>% addTiles() %>%  
  addProviderTiles(providers$CartoDB.Positron)%>%
  addCircles(lng = ~lon, lat = ~lat, weight = 10,
    radius = ~N*10,
    color = ~pal(industry), 
    popup = ~m.i.location
  )%>%
  addLegend("bottomright", pal = pal, values = ~industry,
    title = "Industry",
    opacity = 1
  )